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Abstract 

■ Differential calculus is used routinely across the sciences, albeit with 

differences in notation. These differences are especially apparent when 
working in higher dimensions with higher orders of derivatives. This arti- 
cle scrutinises an efficient coordinate-free notation, hoping to facilitate its 
^) 1 broader adoption. Tensor products, whose purpose has been considered 

difficult to motivate quickly in elementary ways, are purposely shown to 
arise naturally in this context. 
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1 Introduction 

The derivative of a function and the chain-rule formula for differentiating the 
composition of two functions are generally considered elementary because they 
are taught early on to students. Nevertheless, a plethora of articles exist on 
the chain-rule alone, including [Brf51H HHSlHBj . with [H] pointing out an error 
in a well- regarded book [3J p. 3]. The existence of differentiable yet nowhere 
monotone functions [3. , while true, is far from obvious. The history is not 
qq ■ straightforward either; Faa di Bruno was neither the first to state nor prove the 

higher-order chain-rule formula that bears his name 0[1O] . 
■ The present article elucidates that even constructing a convenient notation 

is not entirely elementary. It propounds a minor yet simplifying modification of 
the u Df" notation for Frechet derivatives [TTJ Chapter 8] in certain situations. 

The Df notation is not prevalent in applied fields that favour instead gradi- 
ents, Jacobians and Hessians |13) . This is somewhat surprising since Section [2] 
exemplifies the convenience of the Df notation over element-wise differentiation. 

Subtleties arise when differentiating abstract expressions such as D 3 (f o g). 
SectionQ]highlights the safe approach is tedious while the common alternative of 
omitting variables requires care. The cause of notational difficulty is that higher- 
order derivatives evaluated at a point are multi-linear maps whose arguments 
can again be multi-linear maps, resulting in a tree structure with functions 
within functions. The tensor product suggests itself as a way of collapsing the 
tree to a linear structure by converting multi-linear maps to linear maps. 

The details of how to use tensor products to simplify working with Frechet 
derivatives are not readily found in the literature. No mention is made in 
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the following textbooks on differential calculus [TJ Chapter 2], [TTJ Chapter 
8] , [201 Chapter 4] , [5TJ Chapter 5] , nor in the following textbooks on differential 
geometry Chapter 1], gl Chapter 1.2], [12, Chapter 1.3], [TS]. 

The tensor product can be difficult to motivate early in an undergraduate 
curriculum due to a shortage of elementary contexts genuinely requiring a tensor 
product. Introducing it as a part of calculus changes this; students initially 
treat <g) as a formal symbol separating the arguments of a function and become 
familiar with its product-like behaviour, then later appreciate it reduces multi- 
linear maps to linear maps. 

2 An Example in Matrix Space 

The Df notation provides a coordinate-free approach to differential calculus in 
matrix spaces. It is presented here by way of example. 

Consider f{X) = tr{X T AX} where tr{} denotes trace, superscript T de- 
notes transpose, and A and X are matrices of compatible dimensions. Often 
the derivative of such a function / is represented by its Jacobian matrix whose 
ij-th element is the partial derivative of / with respect to the element Xij of 
X. Evaluating these partial derivatives from first principles is straightforward 
but tedious: use (AB)ij = J2k ^ikBkj twice and tr {Z} = Y^t^u to obtain 
f(X) = X^ijfe XjiAjkXki, differentiate normally, and attempt to convert the 
answer back to matrix form. 

The following is an alternative approach. Explanations follow in subsequent 
sections. Fix a matrix Z of the same dimensions as X. Then: 

f(X + tZ) - f{X) = tr {(X + tZ) T 'A(X + tZ)} -tr{X T AX} (1) 
= (tr {Z T AX} + tr {X T AZ}) t + (tr {Z T AZ}) t 2 . (2) 

Derivatives represent linear approximations, and shows the derivative of / 
at X in the direction Z is tr {Z T AX} + tr {X T AZ}. The meaning may not be 
clear yet, but the calculation was simple! 

The mapping Z ^ tr {Z T AX} + tr {X T AZ) is linear: if it sends Z\ to c\ 
and Z2 to C2 then it sends aZ\ + PZ2 to ac\ -I- /3c2 for a, /3 £ I. This linear 
mapping is the (Frechet) derivative of /. 

Df(X) ■ Z = tr{Z T AX} +tr {X T AZ} (3) 
= tr{Z T {A + A T )X} . (4) 

The Jacobian matrix can be read off as (A + A T )X. 
Treating Z as a constant and differentiating (g]) gives 

{D 2 f(X) ■ Z) ■ T = tr {Z T (A + A T )T) . 

The Hessian is (A + A T ). The left-hand side of ^ is more commonly 
as D 2 f(X)-(Z,T). 

3 First-Order Derivatives and Gradients 

The definition f'(x) = lim^o t~ 1 [f(x + t) — f(x)] of the derivative of a function 
/ : K — >• R extends in several ways to functions / : U —> V between finitc- 
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dimensional vector spaces U and V. The reader may take, for concreteness, U 
and V to be scalars K, vectors K n or matrices R" xm . 

One extension considers directional derivatives so as to reduce to the case 
g : K — > V, g(t) = f(x + tz) for fixed x,z € U, for which the same formula as 
above can be used: 

n f (, y f(x + tz) - f(x) 

Dzf(x) = lrm . (6) 

If the limit exists for all z then ^ is called the Gateaux derivative of / at x. 

Another extension looks beyond © and focuses on the geometric meaning of 
f'(x) as the gradient of the line of best fit to the graph of / at x. This suggests 
defining the derivative as the best linear approximation of / at x. Precisely, fix 
x and assume there exists a linear function A x (z) such that 

||/(a:+ z) - fix) - AJz)\\ 
lim " M ' J ,, ' = 0. (7 

Then A x is unique and is called the Frechet derivative of / at x, denoted Df(x). 
Sometimes, evaluation in a particular direction is denoted using a dot, as in ((3]). 
That is, Df(x) ■ z = A x (z). 

The limit in ([7]) must exist for any sequence {zn}^^ with z n — > 0. Even if 
the mapping z <— > D z f(x) in ([SJ is linear for a fixed x, the Frechet derivative 
need not exist because it is possible for ([7J to hold for sequences z n converging 
to the origin along straight lines but not for sequences following certain curved 
trajectories. This occurs when the limit is not uniform across straight lines: 
convergence to zero is fast along some lines but arbitrarily slow along others. 

An expedient technique for calculating Frechet derivatives is guess-then- 
verify. Verification is unnecessary if the Frechet derivative is known to exist by 
other means. In Section [51 / is a polynomial, hence the Frechet derivative exists 
and can be found using directional derivatives, either explicitly as in or, in 
more complicated situations, by using truncated Taylor series approximations. 
Of course, tables and rules could be used instead. 

If / : U — > M is a scalar function then its gradient at x is defined with respect 
to an inner product. This is often forgotten because the Euclidean inner product 
is chosen without mention in many textbooks. In matrix space, the Euclidean 
inner product is (A,B) = tr{i? T yl}. For a fixed matrix G, A(Z) = (G,Z) is 
a linear functional, and every linear functional can be written this way. The 
gradient of / at X is the matrix Gx such that Df(X) ■ Z = (Gx, Z). 



4 Second-order Derivatives and Hessians 

The Frechet derivative of / : U ->■ V is Df : U ->■ L(U; V) where L(U; V) is 
the vector space of all linear maps from U to V . Applying D to Df yields the 
second-order derivative D 2 f : U — > L(U; L(U;V)). A second-order derivative 
requires not one, but two, directions: (D 2 f(X) ■ T) ■ Z. The right-hand side of 
([TT|) interprets this as the rate of change in the direction T of the directional 
derivative Df(X)-Z. 

To the letter of the law, D 2 f(X) is calculated from as follows. Working 
directly with Df(X) ■ Z is not allowed because Df(X) must be treated as an 
element of L(U; V) when computing D 2 f(X) ■ T = D(Df)(X) ■ T. By assuming 
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the Frechet derivative exists, it suffices to work with directional derivatives: 

owlm . f=h gMz« . (8) 

t— >o t 

For clarity, let L t = Df(X + tT) G L(i7; V). For fixed i, both L t - L and 
(i t — Lq)^ 1 are linear operators in L(U;V). The vector space structure on 
L(U;V) is that induced by pointwise operations: (L t — Lo)t~ l evaluated at Z 
is (Lt ■ Z — Lq ■ Z)t~ l by definition. A sequence of linear operators converges 
if and only if it converges pointwise (throughout, all vector spaces are finite- 
dimensional). Thus, the right-hand side of © can be determined pointwise: 

lim Df[X + tT) Df(X) ) Z = lim Df{X + tT )- Z - gffl ' Z (9 ) 

= tr{Z T (A + A T )T). (10) 

In words, D(Df)(X) ■ T is the linear operator Z ^ tr {Z T (A + A T )T). 

A nominally different quantity is the derivative Dg(X) ■ T where g(X) = 
Df(X) ■ Z for a fixed Z. Nevertheless, Dg(X) ■ T = tr {Z T (A + A T )T), the 
same as ([TO]) . Indeed, the pointwise vector space structure on L(U; V) means 

(D 2 f(X) -T)-Z = (D(Df)(X) -T)-Z = D(Df ■ Z){X) ■ T. (11) 

Therefore D 2 f can be computed from Df(X) ■ Z by treating Z as a constant 
and differentiating with respect to X. This is how ([5]) is obtained from (U). 

The above notation is simple but cumbersome. Textbooks generally drop 
the variables, writing the chain rule and product rule as 

D(fog) = (Dfog)Dg, (12) 
D(fg) = (Df)g + fDg. (13) 

Direct application can lead to confusion though: 

D 2 (fog) = D((Dfog)Dg) (14) 
= (D(Dfog))Dg+(Dfog)D 2 g (15) 
= ((D 2 fog)Dg)Dg+(Dfog)D 2 g. (16) 

Taken literally, it is a nonsense to multiply D 2 f o g with Dg twice. Only with 
experience can {D 2 (f o g)(X) ■ T) ■ Z be deduced from (jTHJ) . 
Including directions from the start reveals 

D(fog).Z=(Dfog).(Dg-Z), (17) 
D(J -{g-Z))-T= (Df ■T).(g.Z)+f. ((Dg -T)-Z), (18) 
(D 2 (f og) ■ T) ■ Z = ((D 2 f o g) ■ (Dg ■ Z)) ■ (Dg ■ Z) 
+ (Dfog)-((D 2 g-T)-Z). 



(19) 



Here, X is omitted because it is simple enough to feed it in to the terms requiring 
it. To be clear, Df o g means evaluate Df at g(X). 

Neither approach is particularly friendly. The former omits important details 
while the latter is tedious; the reader is invited to derive (TT9l) from either (fT7|) 
and ([Tg]), or from flTJ) and 

D(fg)-Z=(Df-Z)g + f(Dg-Z). (20) 
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For scalar fields / : U — > R, the unique linear operator Hx satisfying 
(D 2 f(X) ■ T) ■ Z = (H x ■ T, Z) is the Hessian of / at X. The ordering is 
unimportant because D 2 f(X) is symmetric: (D 2 f(X) -T) ■ Z = (D 2 f(X) -Z)-T 
for all Z and T. When the Euclidean inner product is used, Hx agrees with 
what is called the Hessian matrix |13j . 



5 A Tensor Product Notation for Derivatives 

Given / : U ->• L(V; W) and g : U -> L(Y; V), Df maps into L(U;L(V; W)) 
whereas g maps into L(Y; V), indicating technically the product (Df) g cannot 
be formed. The tensor product allows replacing L(U ; L(V; W)) by L(U ®V; W). 
Equation (fT5|) becomes 

D{fg) = {Df){I®g) + fDg (21) 



where / is the identity map. This has the correctness of ([20]) and almost the 
same brevity as (fT3)l . 

The obvious role of the tensor product is directing variables to their correct 
targets: the g in (Df) g blocks Z from reaching Df when applied on the right, 
while the / in Df (I <S> g) allows the Z through. Although the direct sum would 
serve equally well in this role, it is the tensor product that behaves correctly 
under differentiation: 

D(f®g) = (Df®g) + (f®Dg). (22) 

In particular, (|2"TT) can be differentiated again by using D(I ® g) = (DI ® g) + 
(I (g, Dg) = (0 (8) g) + (I <g> Dg) = I <g) Dg. 

If / is itself a derivative then (IT2|) becomes 



D(fog) = (Dfog)(Dg®I). (23) 

Following these rules gives 

D 2 (f o g) = (D 2 f o g) (Dg ® 7) (I ® D 5 ) + (£>/ o g) D 2 g (24) 
= (fl 2 /oj)(fl S 8fl 9 ) + (fl/o 9 )J} 2 9 , (25) 
D 3 (fog) = (D 3 fog)(Dg®Dg®Dg) 

+ (D 2 fog)[(D 2 g®Dg) + 2(Dg®D 2 g)]+(Dfog)D 3 g. 

The remainder of this section gives the intermediate steps. Section H2 presents a 
formal description of the notation. 

Start with D(fog) = (Df o gr) Dg. Differentiate to get D(Df o gr) (7 ® L> 9 ) + 
(Df o 3 ) D 2 5 . This time, $2$ is required: D(Df o g) = (D 2 f o g) (Dg ® I). 
Tensor products of linear maps satisfy the rule (A ® B) (C <3 D) — (AC ® B_D). 
Therefore, (£>5 <E> I) (I <E> Dg) = Dg ® Dg. 

To obtain (|26D . first apply the product rule (f2"TT) to the two additive terms 
in ([25]). Note D(£>g ® Dg) = (D 2 g ® Dg-) + (Dg ® D 2 g). At this point, 

^ 3 (/ o 5 ) = (D 3 f o g) (Dg ® /) (/ ® (Dg ® Dg)) 

+ (D 2 / o g) [(D 2 g ® Dg) + (Dg ® D 2 g)] (27) 
+ (D 2 / o g) (Dg ® J) (J ® D 2 g) + (D/ o g) D 3 g. 
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The first I in (|27|l acts on U ® U whereas the second acts on U. Regardless, 
it is agreeable to equate (Dg ® I) (I ® (Dg ® Dg)) with Dg ® (Dg ® Dg) — 
Dg ® Dg ® Dg, and (HH) readily follows from (|2Tj) . 

6 Formal Description 

The tensor product notation used in Section [3] is stated formally below. Some 
intricacies appear, but go unnoticed in practice. The tensor product is generally 
not required when differentiating a given function; cf., Section[5] It can simplify 
the differentiation by hand of abstract expressions, such as when seeking bounds 
like the one in (IM)) . 

All spaces are finite-dimensional vector spaces. Basic properties of tensor 
products are used |19j . The main principle is that canonical isomorphisms of 
vector spaces can be applied freely because they essentially commute with the 
Frechet derivative. 

Given f :U ->V, define D k f : U L(U ® ■ ■ ■ <g> U; V) by 

D k f(X) ■{Z l ®---®Z k ) = ((D k f(X) ■Z 1 )---)-Z k . (28) 

For g : U -> L(V; W), define D v (g) : U -)• L(U ® • ■ • <g> U <g> V; W) by 

Dy(g) • (Zi ® • • • ® Z fc ® T) = (((£>*<?(*) • Zi) • • • ) • Z fc ) • T. (29) 

Although Z) 2 / ^ D(Df), they agree up to a canonical linear isomorphism. 
In fact, Z) fc / = Dy^fDf). For all intents and purposes, Dy(g) agrees with 
D U9)V [D v {g) s j because applying the canonical identification U ® (U ® V) = 
U ® J7 <8> V in practice simply means omitting a pair of brackets. 

Given g : U -> L(V; W) and ft : [/ -> L(VF; F), the product rule is 

D v {hg) = D w {h) {Iu®g) + hD v {g) (30) 

where : [/ — > U is the identity map and 7^ (8 g is a tensor field over U whose 
value at X € U is l v ® (g(X)) € L(U <8> V; U ® W). 

Related is the application of a linear map to a vector; given / : U —> W then 

D{h-f) = D w {h){IuMf) + hDf (31) 

where (Ijj Mf)(X) is the linear map Z n> (Z® f(X)). Later, by minor abuse of 
notation, <g> will replace M. Since L(U; V) M W = L(U; V <g> W) = L(U; V) ® W, 
both M and ® behave essentially the same way when differentiated. 

For d : V — > L(W; Y), e : U — > V and / : V — > W, the two chain rules are 

D(fae) = (Dfoe)De, (32) 
D w {doe) = {D w {d)oe){De®I w ) (33) 

where Iw ■ W — > W is the identity map. 

For e : U V and / : U —> W, the tensor product rule is 

D(e ®f) = (De M f) + (eM Df) (34) 

where combines a vector and a linear map to form a linear map, as in (1311) . 
For g : U — > L(V; W) and ft : £/ — ► L(C; Y), the tensor product rule is 

^y®c(ff ® fc ) = (Dy(g) ®h) + {g® D c (h)). (35) 
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For e : U -> V and ft : [/ -> L(C; F), 



L» c (e IS ft) = (L»e (g) ft) + (e IE D c {h)). 



(36) 



If the codomains of all functions are spaces of linear maps then the situation 
is particularly simple; (j30p . (1331) and (|35l) suffice. This is the typical situation 
when computing higher-order derivatives because the codomain of the derivative 
of a function is a space of linear maps. It is possible to reduce to this situation 
by replacing / : U -> W with f:U-> L(R; W) where f{X) = f(X) ■ 1. The -1 
can be removed, the derivatives calculated, and the T applied at the very end. 
This explains the similarity of J3DJ and ([21]), and of ([3"3]). ([3"5]) and (13H1) . 

In practice, it is easier to replace IE by Cg) than replace / by /. No con- 
fusion arises because HI and ® behave the same way with respect to addition, 
multiplication and differentiation. 

The subscripts on D used in (j30|) -(f36 f merely keep all derivatives in a con- 
sistent form and can be dropped. When computing higher-order derivatives 
recursively, to account for D k differing from D k ~ 1 by a linear isomorphism, it 
is only necessary to remove any remaining brackets in tensor products at the 
end of each step, e.g., replace Dg <g> (Dg ® Dg) by Dg ® Dg <g) Dg in (071) . 

Once M is replaced by ® and the subscripts dropped on D, the rules collapse 
to those in Section [5j 

7 Discussion 

Section [5] elicited the relationship between D and D. This viewpoint highlights 
the usefulness of the tensor product as a mathematical operation and validates 
its use when computing bounds on the (operator) norms of derivatives, e.g., 



Alternatively, a mechanical calculus can be developed, where ® is a formal 
symbol used to direct variables to their correct targets. This would follow the 
course of building a class f2 of allowable expressions, explaining how D is applied 
to members of this class, and verifying the class is algorithmically closed under 
D. This mechanical viewpoint could be used to write trees of nonlinear functions 
of multiple arguments as a serial composition of functions of a single argument; 
the utility is questionable though. 

Frechet derivatives are defined on Banach spaces. The aforementioned me- 
chanical viewpoint implies the tensor product notation remains applicable. De- 
pending on the application, a subtlety is that tensor products are not uniquely 
defined on Banach spaces; different choices of norms, and hence completions 
with respect to that norm, are possible [17] . 

8 Conclusion 

The presentation aimed to complement traditional texts on matrix differen- 
tial calculus. The Df notation is convenient for differentiating given functions 
(Section^]) but has its subtleties when differentiating abstract expressions (Sec- 
tion 2]). Tensor products provide a notational convenience that simplifies certain 



\\D 2 (f o g )|| < \\D 2 f o g\\ \\Dg ® Dg\\ + \\Df o g\\ \\D 2 g\\ 

< || J D 2 /°.9lll!A9l! 2 + l!A/°.9l!P 2 3ll- 



(37) 
(38) 
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calculations (Section [5]). This is pedagogically interesting as an elementary yet 
genuine application of the tensor product. 
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